Group 17

Assignment 2. Multidimensional scaling of a high-dimensional dataset.

The data set baseball-2016.xlsx contains information about the scores of baseball teams in USA in 2016, such as:

Games won, Games Lost, Runs peer game, At bats, Runs, Hits, Doubles, Triples, Home runs, Runs batted in, Bases stolen, Time caught stealing, Bases on Balls, Strikeouts, Hits/At Bats, On Base Percentage, Slugging percentage, On base+Slugging, Total bases, Double plays grounded into, Times hit by pitch, Sacrifice hits, Sacrifice flies, Intentional base on balls, and Runners Left On Base.

  1. Load the file to R and answer whether it is reasonable to scale these data in order to perform a multidimensional scaling (MDS).

The dataset is conformed of 30 unique baseball teams with some statistics about their runs in some leagues (NL and AL). In this case it would be reasonable to scale (MDS) or reduce the dimensionality of the vectors in order to get a more digestiable dataset, on which we can compare each team and see how close to each other they are.

## [1] "Dimensions of the dataset:"
## [1] 30 28
  1. Write an R code that performs a non-metric MDS with Minkowski distance=2 of the data (numerical columns) into two dimensions. Visualize the resulting observations in Plotly as a scatter plot in which observations are colored by League. Does it seem to exist a difference between the leagues according to the plot? Which of the MDS components seem to provide the best differentiation between the Leagues? Which baseball teams seem to be outliers?

There seems to be a difference between Leagues, but it’s not that clear. It’s visible that most of the teams (66.6%) that belong to the AL League are on the positive axis of V2 while the teams from the NL League are more spread across this axis. So there might be some differences between both leagues, but they are not that pronounced.

The component that seems that helps more differentiate between Leagues is the V2 as stated above.

## initial  value 19.856833 
## iter   5 value 16.319153
## iter  10 value 16.046215
## final  value 15.935476 
## converged
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

The teams that seem to be outliers are:

  1. Use Plotly to create a Shepard plot for the MDS performed and comment about how successful the MDS was. Which observation pairs were hard for the MDS to map successfully?

It looks like the MDS performance was average, taking into account that a perfect encoding of dissimilarities would yield a 1 to 1 relationship between variables. There were a few observations that looks like outliers that were pretty difficult for the MDN algorithm to map. Below is presented some of those points that were hard for the algorithm.

  1. Produce series of scatterplots in which you plot the MDS variable that was the best in the differentiation between the leagues in step 2 against all other numerical variables of the data. Pick up two scatterplots that seem to show the strongest (postivie or negative) connection between the variables and include them into your report. Find some information about these variables in Google - do they appear to be important in scoring the baseball teams? Provide some interpretation for the chosen MDS variable.

The best two variables that separate both leagues are SH and IBB. Both variables refers to the number that a certain play is made. For SH the play is called sacrifice hits and for IBB the play is called intentional bases on balls. They are abbreviations for offensive plays. Which could lead to believe that the variable V2 obtained from the MDS is related in some way with defensive plays. The direction to which defense grows will depend on whether the NL League is more defensive than the AL League.